3.1 Encoder and Decoder Stacks

Encoder

「全く同じ層をN=6つ積む」

各層は2つのサブレイヤーからなる

The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

2つのサブレイヤーは、Figure 1の「Multi-Head Attention」と「Feed Forward」

We employ a residual connection around each of the two sub-layers, followed by layer normalization

Figure 1でサブレイヤーの後ろについている「Add & Norm」

👉residual connection & Layer Normalization

Attention Is All You Needで前者で参照しているのは、Deep Residual Learning for Image Recognition（Refs 11）

That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.

Figure 1で入力xそのものがAdd & Normにも入っていることを示す：LayerNorm(x + Sublayer(x))

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.

「埋め込み層だけでなくモデルの全サブレイヤーも出力の次元d_model=512」

Decoder

「全く同じ層をN=6つ積む」

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.

「各エンコーダ層の2つのサブレイヤーに加えて、デコーダは3つ目のサブレイヤーを挿入する」

「3つ目のサブレイヤーは、エンコーダスタックの出力にmulti-head attentionを行う」

Figure 1の右側の真ん中のMulti-Head Attentionはたしかにエンコーダの入力が入っている

KとVがEncoderの出力、Qがdecoderの1つ目のMulti-Head Attentionの出力（ref: 3.2.3）

Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

Figure 1で3つのサブレイヤーにAdd & Normが付いている

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.

「位置が次の位置を伴わないように、デコーダスタックのself-attentiionサブレイヤーを修正する」

This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

「出力の埋め込みはある位置によって相殺される（？ TODO）事実と組合せて、このマスキングは、位置iの予測はiより小さい位置の既知の出力にのみ依存することができることを保証する」